High-resolution portrait stylization frameworks using a hierarchical variational encoder

ABSTRACT

Systems and method directed to an inversion-consistent transfer learning framework for generating portrait stylization using only limited exemplars. In examples, an input image is received and encoded using a variational autoencoder to generate a latent vector. The latent vector may be provided to a generative adversarial network (GAN) generator to generate a stylized image. In examples, the variational autoencoder is trained using a plurality of images while keeping the weights of a pre-trained GAN generator fixed, where the pre-trained GAN generator acts as a decoder for the encoder. In other examples, a multi-path attribute aware generator is trained using a plurality of exemplar images and learning transfer using the pre-trained GAN generator.

BACKGROUND

Portraiture, the art of depicting the appearance of a subject, is animportant art form dating back to the beginning of civilization. It hasevolved beyond faithful depiction into more creative interpretationswith a plethora of styles, such as abstract art, Cubism and cartoon.Automatically stylized portraiture has undergone rapid progress inrecent years due to advances in deep learning. Early methods involvingneural style have convincingly demonstrated the ability to transfertextural styles from an exemplar source to target images, with realphotos transformed into Van Gogh or Picasso paintings. However, when itcomes to portraiture, these methods largely failed to capture theimportant geometry-dependent motifs of different portraiture styles,thus falling short in stylization quality.

Image-to-image translation methods were later introduced to “translate”images from a source domain to a target domain using paired datasets ina supervised manner or using unpaired datasets in an unsupervisedsetting. These methods have been explored for portrait stylization, e.g.self-to-anime and cartoon. However, supervised approaches require paireddatasets for training that would be manually onerous if not infeasible,while the unsupervised approaches not only need a large amount ofunpaired data, but also often face difficulties with stable trainingconvergence and in generating high-resolution results. A recent portraitstylization pipeline, Toonify, builds on a pre-trained model of thehigh-resolution generative neural network StyleGAN2. Using a few hundredunpaired exemplars, Toonify generates promising results in cartoon styleby employing transfer learning to adapt StyleGAN2 to the given styleexemplars. When given an input image, the corresponding latent code isobtained by an optimization-based inversion in one of the StyleGAN2latent spaces, which is then used to generate the stylized output viathe adapted StyleGAN2 model. Despite its strong generalization abilitygiven only limited exemplars, the stylization of real input images (incontrast to StyleGAN2 realistically synthesized ones) may includevarious artifacts, likely due, at least in part, to the sub-optimalityof the inversion method used. That is, while Toonify's inverse mappingmay work well for reconstructing real faces, it is not very robust todifferent styles.

It is with respect to these and other general considerations thatembodiments have been described. Although relatively specific problemshave been discussed, it should be understood that the examples describedherein should not be limited to solving the specific problems identifiedin the background above.

SUMMARY

Portraiture as an art form has evolved from realistic depiction into aplethora of creative styles. While substantial progress has been made inautomated stylization, generating high quality stylistic portraits isstill a challenge, and even the recent popular Toonify stylizationplatform suffers from several artifacts when used on real input images.Such StyleGAN-based methods have focused on finding the best latentinversion mapping for reconstructing input images; however, focusing onfinding the best latent inversion mapping for reconstructing inputimages has not led to good generalization for different portrait styles.In accordance with examples of the present disclosure, an AgileGANframework is proposed that generates high quality stylistic portraitsvia inversion-consistent transfer learning. The AgileGAN frameworkincludes a hierarchical variational autoencoder; the hierarchicalvariational autoencoder generates an inverse mapped distribution thatconforms to the original latent Gaussian distribution provided by aStyleGAN-based network, while augmenting the original latent space to amulti-resolution latent space so as to provide encoding for differentlevels of detail. To better capture attribute dependent stylization offacial features, the AgileGAN framework includes an attribute-awaregenerator; the attribute-aware generator may adopt an early stoppingstrategy to avoid overfitting small training datasets. Such anarchitecture provides greater agility in creating high quality and highresolution (e.g., 1024×1024) portrait stylization models. Further, suchmodels can operate on a limited number of style exemplars (for example,around 100 exemplar images) and therefore can be trained in a shorteramount of time (e.g., −1 hour). In accordance with examples describedherein, enhanced portrait stylization and quality can be achieved whencompared to previous state-of-the-art methods. Further, such techniquesmay be applied to applications that include but are not limited to imageediting, motion retargeting, pose, and video applications. Additionalinformation about GAN networks, including StyleGAN-based networks andStyleGAN2 can be found in the following printed papers: “A Style-BasedGenerator Architecture for Generative Adversarial Networks” to T.Karras, S. Laine, and T. Aila., in Proc. IEEE/CVF Conference on ComputerVision and Pattern Recognition, 2019 and “Analyzing and Improving theImage Quality of StyleGAN” to T. Karras, S. Laine, M. Aittala, J.Hellsten, J. Lehtinen, and T. Aila, in Proc. IEEE/CVF Conference onComputer Vision and Patter Recognition, 2020 both of which areincorporated herein by reference, for all that they teach and allpurposes.

In accordance with at least one example of the present disclosure, amethod for generating a stylized image is described. The method mayinclude receiving an input image; encoding the input image using avariational autoencoder to obtain a latent vector; providing the latentvector to a generative adversarial network (GAN) generator; generating,by the GAN generator, a stylized image from the GAN generator; andproviding the stylized image as an output.

In accordance with at least one example of the present disclosure, asystem for generating a stylized image is described. The system mayinclude a processor; and memory including instructions, which whenexecuted by the processor, causes the processor to: receive an inputimage; encode the input image using a variational autoencoder to obtaina latent vector; provide the latent vector to a generative adversarialnetwork (GAN) generator; generate, by the GAN generator, a stylizedimage from the GAN generator; and provide the stylized image as anoutput.

In accordance with at least one example of the present disclosure, acomputer-readable storage medium including instructions is described.The instructions, which when executed by a processor, cause theprocessor to: receive an input image; encode the input image using avariational autoencoder to obtain a latent vector; provide the latentvector to a generative adversarial network (GAN) generator; generate, bythe GAN generator, a stylized image from the GAN generator; and providethe stylized image as an output.

This summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

Non-limiting and non-exhaustive examples are described with reference tothe following Figures.

FIG. 1 depicts an example of a t-Distributed Stochastic NeighborEmbedding (t-SNE) visualization of latent code distributions fordifferent inversion methods and the relation to stylized image qualityin accordance with examples of the present disclosure.

FIG. 2 depicts aspects of the stylized training and stylized imagegeneration system in accordance with examples of the present disclosure.

FIG. 3 depicts additional details of the stylized training andconversion server in accordance with examples of the present disclosure.

FIG. 4 depicts details associated with training a hierarchicalvariational autoencoder (hVAE) in accordance with examples of thepresent disclosure.

FIG. 5 depicts additional details of the hierarchical variationalautoencoder in accordance with examples of the present disclosure.

FIG. 6 depicts details of an attribute-aware generator in accordancewith examples of the present disclosure.

FIG. 7 depicts additional details directed to training theattribute-aware generator in accordance with examples of the presentdisclosure.

FIG. 8 depicts details directed to the inference process of the trainedAgileGAN model in accordance with examples of the present disclosure.

FIG. 9 depicts details of a method for training an AgileGAN model inaccordance with examples of the present disclosure.

FIG. 10 depicts details of a method for generating a stylized image froman input image in accordance with examples of the present disclosure.

FIG. 11 depicts a block diagram illustrating physical components (e.g.,hardware) of a computing system with which aspects of the disclosure maybe practiced.

FIGS. 12A-12B illustrate a computing device with which embodiments ofthe disclosure may be practiced.

FIG. 13 illustrates one aspect of the architecture of a system forprocessing data.

DETAILED DESCRIPTION

In the following detailed description, references are made to theaccompanying drawings that form a part hereof, and in which are shown byway of illustrations specific embodiments or examples. These aspects maybe combined, other aspects may be utilized, and structural changes maybe made without departing from the present disclosure. Embodiments maybe practiced as methods, systems, or devices. Accordingly, embodimentsmay take the form of a hardware implementation, an entirely softwareimplementation, or an implementation combining software and hardwareaspects. The following detailed description is therefore not to be takenin a limiting sense, and the scope of the present disclosure is definedby the appended claims and their equivalents.

Stylizing facial images in an artistic manner has been explored in thecontext of non-photorealistic rendering. Early approaches relied on lowlevel histogram matching using linear filters. Neural style transfer, bymatching feature statistics in convolutional layers, led to earlyexciting results via deep learning. Since then, several improvementsdirected to enforcing local patterns in deep feature space via a Markovrandom field (MRF) and extending style transfer to video and improvedthe quality by imposing temporal constraints have been proposed.Although these methods can achieve generally compelling results forseveral artistic styles, they usually fail on styles involvingsignificant geometric deformation of facial features, such ascartoonization. For more general stylization, image-to-image (I2I)translation may be used to translate an input image from a source domainto a target domain.

Conditional generative adversarial networks (GAN) may be implemented tolearn the input-to-output mapping. Similar ideas have been applied tovarious tasks, such as sketches-to-photographs and attribute-to-images.However, these methods require paired training data, which is hard toobtain. To avoid this, conditional image generation has been used in anunsupervised manner. For example, the well-known cycle-consistency lossin CycleGAN has been proposed to improve network training stability forthe unpaired setting. Unsupervised methods have also been used incartoonization. Further, CycleGAN has been extended to cross-domainanime portrait generation, and other unsupervised methods haveincorporated an attention module and a learnable normalization functionfor cartoon face generation, where their attention-guided model canflexibly control the amount of change in shape and texture. Althoughthese methods can conduct plausible image translation, such networksrequire extensive training data, and thus most were trained forrelatively low image resolutions.

Recently, a GAN interpolation framework for controllable cross-domainimage synthesis, called Toonify, has been proposed to generatephoto-realistic cartoonization. However, Toonify's inversion mappingwhen applied to real images may introduce undesired artifacts in thestylized output. In contrast, examples of the present disclosure utilizea variable autoencoder (VAE) inversion which enhances distributionconsistency in latent space, leading to better results for real inputimages.

GANs have been used to synthesize images that ideally match the trainingdataset distribution via adversarial training. GANs have been applied tovarious areas, including but not limited to image inpainting, imagemanipulation, and texture synthesis. Various advancements have been madeto improve the architecture, synthesis quality, and training stabilityof GANs. However, initial methods were mainly limited to low resolutionsdue to computational cost and shortage of high-quality training data. Ahigh-quality human face dataset, CelebAMask-HQ, was collected, and aProGAN architecture was proposed to train GANs for high resolution imagegeneration via a progressive strategy. The ProGAN architecture generatesrealistic human faces at a high resolution of 1024×1024. Similarly, ahigh resolution human face dataset called Flickr-Faces-HQ (FFHQ), wascollected and a generator architecture called StyleGAN was proposed,inspired by adaptive normalization for style transfer. StyleGAN furtherimproves face synthesis quality to a level that is almostindistinguishable from real photographs. StyleGAN has been extended toStyleGAN2, which reduced artifacts and improved disentanglement usingperceptual path length. Examples of the present disclosure build uponStyleGAN2 and leverage StyleGAN2's pre-trained weights asinitialization.

Since GANs are typically designed to generate realistic images bysampling from a known distribution in latent space, GAN inversionaddresses the complementary problem of finding the most accurate latentcode, when given an input image, that will reconstruct that image. Oneapproach is based on optimization, which is directly optimizing thelatent code to minimize the pixel-wise reconstruction loss for a singleinput instance. Another approach is learning-based, in which adeterministic model is trained by minimizing the difference between theinput and synthesized images. Other works combine these the optimizationand learning-based approaches by learning an encoder that produces agood initialization for subsequent optimization. In addition to imagereconstruction, some examples also use inversion when undertaking imagemanipulation. For example, a hybrid method may encode images into asemantic manipulable domain for image editing. In addition, a genericPixel2Style2Pixel (PSP) encoder has been proposed; such an encoder isbased on a dedicated identity loss for embedding images in several realimage translation tasks, such as inpainting and super resolution.However, the processes used by the PSP encoder for single domainmanipulation or reconstruction may not be directly applicable tocross-domain generation due to insufficient consistency in the latentdistributions, which is addressed by the examples provided herein.

Training a modern high-quality, high-resolution GAN typically requires alarge number of images (e.g., 10⁵ to 10⁶), which is a costly undertakingin terms of acquisition, processing, and distribution. There are a fewtechniques to reduce such requirements. For example, a few-shot learningtechnique has been proposed to perform appearance translation withoutneeding a large dataset of specific style translation pairs. However, apre-trained style embedding network is required and the generated imageresolution is limited. Conversely, the idea of patch-based training hasbeen explored, as less training data is needed when learning patchdistributions. However, such techniques may not easily be relevant toportrait generation, since human faces have strong geometry semanticsand may not simply be reduced to smaller patches for training. Toaddress the issue of data shortage, examples presented herein are basedon applying transfer-learning to the StyleGAN-based architecture andadopting and an early stopping strategy to generate optimal results.

As previously mentioned, finding the best inversion mapping in terms ofreconstruction in the original StyleGAN2 is in fact misguided, becausewhat may be best for realistic image generators may not be best forother stylized generators. Instead, a learned inversion mapping thatalso optimizes for matching the distribution of latent codes to theGaussian latent distribution in the original StyleGAN2 may lead tobetter results across a range of different stylized generators. In otherwords, matching latent distributions when learning the inversion leadsto robust embedding across different styles, and is better than aimingfor the best reconstruction embedding for realistic images.

FIG. 1 depicts an example of a t-Distributed Stochastic NeighborEmbedding (t-SNE) visualization 102 of latent code distributions fordifferent inversion methods and the relation to stylized image qualityin accordance with examples of the present disclosure. t-SNE is atechnique for dimensionality reduction that is particularly well suitedfor the visualization of high-dimensional datasets. The originalStyleGAN2 latent distribution is depicted as latent distribution 104.When stylizing an input image, such as the input image 106, using amodel as described herein having a latent code distribution 108 that isaligned to or otherwise overlaps the original latent distribution 104leads to more pleasing results. That is, a stylized portrait 110 may begenerated from the input image 106 using a Hierarchical VariationalAutoencoder (hvAE) as described herein according the embodiments of thepresent disclosure, where the stylized portrait 110 is generated using amodel having a latent code distribution 108 that is aligned to and/oroverlaps the original StyleGAN2 latent distribution 104. The t-SNEvisualizations 102 also depicts other latent code distributions that maybe used by other stylizing models when stylizing an input image 106. Forexample, the Toonify model may utilize the latent code distribution 112when generating the stylized image 114. The latent code distribution 112is not aligned to or otherwise does not overlap the original StyleGAN2latent distribution 104. As another example, a PSP model may utilize thelatent code distribution 116 when generating the stylized image 118. Thelatent code distribution 116 is not aligned to or otherwise does notoverlap the original StyleGAN2 latent distribution 104. Similarly, anin-domain model may utilize the latent code distribution 120 whengenerating the stylized image 122. The latent code distribution 120 isnot aligned to or otherwise does not overlap the original StyleGAN2latent distribution 104. Because the latent code 112, 116, and 120 isnot aligned to or otherwise does not overlap the original StyleGAN2latent distribution 104, the inversion mapping when applied to realimages as input often introduces undesired artifacts in the stylizedoutput image. For example, geometric deformations of facial features maybe visible in the output stylized output image.

In accordance with examples of the present disclosure, AgileGAN—aninversion-consistent transfer learning framework for portraitstylization, as described herein, includes an (hVAE) and anattribute-aware generator that works on a limited number of exemplars.Such framework generates high quality and high resolution portraitstylization models in a variety of target styles. To achieve inversionconsistency in the described AgileGAN framework, a hierarchical hVAE isused to perform the inversion. Compared to other latent space inversiontechniques that may operate on the less entangled latent space W, usingthe hVAE ensures that the mapping conforms to the multi-variate Gaussiandistribution of the original GAN latent space, such as but not limitedto a StyleGAN-based model. Furthermore, the hVAE is hierarchical in thatthe StyleGAN-based model's original Z latent space is augmented to amulti-resolution latent space Z+ to better encode different levels ofdetail in the image. Using the Z+ augmentation and hVAE significantlyimproves stylization quality.

To improve the training efficiency with a high resolution dataset, thetraining process is decomposed into two stages. In the first stage, thehVAE is trained for inversion encoding using the original StyleGAN-basedmodel (e.g., StyleGAN2) as the decoder with fixed pre-trained weights.During such training, losses including the reconstruction loss, useridentity loss, perceptual loss, and KL divergence loss are enforced forthe VAE. In the second stage, latent codes are sampled in the Z+ spacefrom a multi-variate Gaussian distribution; an attribute-aware generatoris then fine-tuned starting from the StyleGAN-based model's (e.g.,StyleGAN2) pre-trained weights. The training losses include anadversarial loss with the given style exemplars, a facial structuralloss, as well as R1 and perceptual path-length regularization losses.The attribute-aware generator includes multiple generative paths fordifferent attributes (e.g. hair color, hair length, etc.) and multiplediscriminators to better capture attribute-dependent stylization offacial features. To avoid overfitting caused by a small trainingdataset, and to better balance identity and style, an early stoppingstrategy in training the StyleGAN-based model is adopted. Duringinference, the stylized output from an input image can be generatedusing the hVAE encoder and the attribute-aware generator.

FIG. 2 depicts aspects of the stylized training and stylized imagegeneration system 200 in accordance with examples of the presentdisclosure. The stylized training and stylized image generation system200 generally includes a computing device 204 communicatively coupled toa stylized training and conversion server 210 via a network 208. Inexamples, a user 202 may select a plurality of training images 206 andprovide the plurality of training images 206 to the stylized trainingand conversion server 210 to train an hVAE. In addition, the user 102may provide the plurality of exemplar images 207 to the stylizedtraining and conversion server 210 to train a stylization model, such asan AgileGAN model 217, with a style and/or attribute exhibited by theplurality of exemplar images 207. For example, the plurality of exemplarimages 207 may correspond to examples of cartoon characters, animals,etc. In some examples, the plurality of exemplar images 207 may bespecific to a particular attribute that the user 102 would like enhancedor otherwise prefer to see in the resulting stylized images. Forexample, the plurality of exemplar images 207 may exhibit one or more ofa specific hair color, facial appearance, hair length, pose, lightingcondition, etc. The stylized training and conversion server 210 mayutilize transfer learning to train a pre-trained GAN model (e.g.,StyleGAN2 and/or StyleGAN-based), and therefore a GAN generator 220,using the plurality of exemplar images 207. In some examples, followingtraining, the stylized training and conversion server 210 may output atrained AgileGAN model including a trained hVAE 218 and generator 220.The hVAE 218 may be trained using a large quantity of high-qualityimages. Alternatively, or in addition, the stylized training andconversion server 210 may receive one or more input images 212, generateone or more stylized images 214 based on the one or more input images212, and provide the one or more stylized images 214 to the computingdevice 204 of the user 102. The one or more stylized images 214 may bedisplayed at a user interface 203 of the computing device 204.

FIG. 3 depicts details of the stylized training and conversion server302 in accordance with examples of the present disclosure. Morespecifically, the stylized training and conversion server 302 may be thesame as or similar to the stylized training and conversion server 210previously discussed. The stylized training and conversion server 302may include a communication interface 304, a processor 306, and acomputer-readable storage 308. In examples, the communication interface304 may be coupled to a network and receive the plurality of trainingimages 325, the plurality of exemplar images 326, and one or more inputimages 324 for stylization. The image acquisition manager 316 may managethe acquisition of the images, and in some instances, may performpreprocessing of the images to ready for training and/or stylization.The image 324 may be the same as or similar to the input image 212 (FIG.2 ); the training images 325 may be the same as or similar to thetraining images 206 (FIG. 2 ); and the exemplar images 326 may be thesame as or similar to the exemplar images 207 (FIG. 2 ). In someexamples, one or more attribute selections may be received at thecommunication interface 304 and stored as an attribute selection 328.For example, an explicit attribute for hair color, etc. may be receivedas an attribute selection 328. While the image 324, training images 325,exemplar images 326, and attribute selection 328 are depicted as beinginput 312, other information and input may be received at thecommunication interface 304 and stored as input 312. For example, one ormore model parameters (e.g., hyperparameters, model configurations, Z+spaces, etc.) may be received at the communication interface 304 andstored as input 312.

The stylized training and conversion server 302 includes an AgileGANtraining framework 317 for training the hVAE 318 and the attribute-awaregenerator 322. The AgileGAN training framework 317 may include apre-trained GAN model (e.g., StyleGAN-based model and/or StyleGAN2model) 319 including a pre-trained GAN generator 320 (e.g.,StyleGAN-based generator and/or StyleGAN2 generator). In examples, thehVAE 318 and the attribute-aware generator 322 may be trainedindependently of one another. Using the training images 325 togetherwith the GAN generator 320, the hVAE 318 may be trained for inversion bylearning the posterior distribution of the GAN model 319 using the fixedpre-trained GAN model 319 as a decoder. Given a small set of stylisticexemplars, for example those exemplar images stored as exemplar images326, the AgileGAN training framework 317 can utilize transfer learningto train the attribute-aware generator 322 using the pre-trained GANmodel 319 and the pre-trained GAN generator 320. Accordingly, thestylized training and conversion server 302 can output an AgileGANframework 336 including a trained hVAE 338 and a trained attribute-awaregenerator 340 for generating stylized images from real portrait images.In one example, the trained attribute-aware generator 340 can beimplemented by another device instead of the stylized training andconversion server 302 to perform the operation of generating stylizedimages from real portrait images. Alternatively, or in addition, thestylized training and conversion server 302 may receive an input of animage 324 and generate a stylized image 334. The stylized image 334 maybe recognizable as the input subject's identity and may preserve thesubject's pose and expression. In addition, the stylized image 334 maybe rendered in a style that is consistent with the provided stylisticexemplars, such as the exemplar images 326. In examples, the stylizedtraining and conversion server 302 may perform both model training andstylized image generation, only model training, or only stylized imagegeneration.

FIG. 4 depicts details associated with training a hierarchicalvariational autoencoder (hVAE) 404 of an AgileGAN framework inaccordance with examples of the present disclosure. The AgileGANframework may be the same as or similar to the AgileGAN model 336 (FIG.3 ). The arrows in FIG. 4 indicate dataflows associated with imageembedding. Both a multi-layer perceptron (MLLP) 414 and the GANgenerator 418 (e.g., StyleGAN-based generator and/or StyleGAN2generator) include block weights derived from a GAN model (e.g.,StyleGAN-based generator and/or StyleGAN2 generator) pre-trained on adataset such weights are frozen during the training process provided inFIG. 4 . The GAN generator 418 may be the same as or similar to the GANgenerator 320 (FIG. 3 ); the hVAE 404 may be the same as or similar tothe hVAE 318 (FIG. 3 ) and once trained, the hVAE 338 (FIG. 3 ).

The starting baseline for training the AgileGAN framework is apre-trained GAN model (e.g., StyleGAN-based model and/or StyleGAN2model), such as the pre-trained GAN model 319 (FIG. 3 ). The pre-trainedGAN model (e.g., StyleGAN-based model and/or StyleGAN2 model) exhibitsthe property that if random samples from a Gaussian distribution in theZ latent space are acquired, the model can generate images fitting theoriginal training distribution, for example, the original trainingdistribution of the dataset. As previously mentioned, training theAgileGAN model may include two stages. Since the task of traininginvolves using an image as input, a corresponding latent vector for theGAN model (e.g., StyleGAN-based model and/or StyleGAN2 model) model isdetermined. A front-end encoder, such as the hierarchical variationalencoder 404, is trained to map input images (e.g., images 402 which maybe the same as or similar to the training images 325) to latent spaceswhile keeping the back-end GAN generator 418 fixed. In a second processdetail in FIG. 6 , starting from a copy of the pre-trained GAN model(e.g., StyleGAN-based model and/or StyleGAN2 model), the pre-trained GANmodel (e.g., StyleGAN-based model and/or StyleGAN2 model) is fine-tunedsuch that a sample from a Gaussian distribution in the latent space cangenerate images that better fit the stylistic exemplars. In examples,the two training stages are executed independently and can be trained inparallel. However, structurally the two training stages share pivotlatent spaces (Z+ 413 and W+ 417 described later in this specification),and are also jointly anchored by the fixed GAN generator 418. Byseparating inversion training and generation training into two stages aspreviously mentioned, the training does not require paired datasets; theseparation of training also enables higher resolutions by reducing acomputational load in making backpropagation process more effective andefficient. Thus, rather than fine-tuning the architecture of theAgileGAN model, new style domains can be incorporated by fine-tuning thegenerators.

The pre-trained GAN model (e.g., StyleGAN-based model and/or StyleGAN2model) is equipped with two latent spaces: the original latent space Z412 under a Gaussian distribution, and a less entangled W space 416,which is mapped from Z 412 through a Multi-Layer Perceptron (MLP) f 414.While the original GAN generation (e.g., StyleGAN2) is conducted in acoarse-to-fine manner using several disentangled layers but with thesame latent code input to each layer, to enlarge the AgileGAN model'sexpressiveness, a different latent code is input for each disentangledlayer of the AgileGAN model, allowing for individual control. This isequivalent to stacking multiple versions of the original latent space Z412 to form a new space Z+ 413. Unlike most embedding methods thattarget single-domain image editing or pixel-level reconstruction bymanipulating the W space 416, the Z+ space 413 is utilized at least inpart, because stylization uses cross-domain image generation.Cross-domain image generation increases the difficulty when directlyembedding into the W space 416 without suffering deterioration instylization quality, since all codes in the W space 416 may not beappropriate for stylization. Further, the W space 416 is covered by acomplex non-Gaussian distribution; directly encoding images into the Wspace 416 via a network may not correspond appropriately to a Gaussiandistribution in the Z+ space 413. Accordingly, as described herein,stylization is addressed via Z+ space 413, as more constrained Gaussianmodeling leads to better regularization across different styles.

Traditional autoencoders generally lack the ability to generate newimages because the resulting latent space is discontinuous. To force theautoencoder to generate a continuous latent space, an output vector ofmeans 406 and an output vector of standard deviations 408 are utilized.Training the hierarchical variational encoder 404 includes optimizingfor Kullback-Leibler divergence 410 (e.g., a mean close to 0 and astandard deviation close to 1) in addition to image reconstruction andother losses which may rely on the means 406 and standard deviations408. The standard deviation 408 and the mean 406 may be sampled,generating the latent z vector corresponding to an input image of theplurality of input images 402. While a typical variational autoencoderincludes an encoder ε_(θ) and a decoder Gϕ (e.g., the GAN generator 418)with respective parameters θ and ϕ, which are trained jointly tominimize reconstruction error between input image x (e.g., an image ofthe plurality of training images 402) and output image x (e.g., an imagegenerated by the GAN generator 418), the hVAE 404 for inversion uses afixed original pre-trained GAN model (e.g., StyleGAN-based model and/orStyleGAN2 model) as the decoder Gϕo (e.g., GAN generator 418), and thehVAE 404 is trained to learn the posterior distribution q(z|x). Theencoding parameters θ may be trained using the stochastic gradientvariational Bayes (SGVB) algorithm to solve:

$\theta^{*} = {{\underset{\theta}{\arg\min}{{\mathbb{E}}_{z\sim\varepsilon_{\theta(x)}}\left\lbrack {{- \log}{p\left( {x❘{\mathcal{z}}} \right)}} \right\rbrack}} + {D_{kl}\left( {{\varepsilon_{\theta}(x)}{{p({\mathcal{z}})}}} \right)}}$where D_(kl) denotes the Kullback-Leibler (in the following referred toas KL) divergence. The posterior importance/distribution, mapped by thevariational autoencoder 404 from x, is modeled as a multivariateGaussian distribution q(z|x)=ε_(θ)(x)=N(z_(μ), diag(z_(σ) ²)), wherez_(σ), z_(μ)∈

^(18×512) are the multi-dimensional output of ε_(θ)(x), representing themean and standard deviation respectively in a diagonal matrix form. Theprior p(z)=N(0, I) as used in StyleGAN2, and thus the KL divergence canbe expressed in the analytic form of:

${{D_{kl}\left( {{\varepsilon_{\theta}(x)}{{N\left( {0,I} \right)}}} \right)} = {\frac{1}{2}{\sum\limits_{i}\left( {1 + {2\log{\mathcal{z}}_{\sigma,i}} - {\mathcal{z}}_{\mu,i}^{2} - {\mathcal{z}}_{\sigma,}^{2}} \right)}}},$where the summation applies across all dimension of z_(σ) and z_(μ).Backpropagation is made differentiable via the reparameterization trick,whereby z can be sampled according to:z=z _(μ) +∈⊗z _(σ) ,∈˜N(0,I),where ⊗ is an element-wise matrix multiplication operator.

Multiple loss functions are used in training the hVAE 404 (e.g., ε_(θ)).An L₂ loss for reconstruction can be generated as follows:

_(rec)=

₂(x,

_(ϕ) _(o) (ε_(θ)(x)))This measures the pixel-level differences between input image x andgenerated output

_(ϕ) _(o) (ε_(θ)(x)). In addition, the LPIPS loss is used to learnperceptual-level similarities:

_(per)=

_(lpips)(x,

_(ϕ) _(o) (ε_(θ)(x)))

To preserve identity, the facial recognition loss is used as follows:

_(id)=

_(arc)(x,

_(ϕ) _(o) (ε_(θ)(x)))where

_(arc) is based on the cosine similarity between intermedia featuresextract from a pre-trained ArcFace recognition network, comparing theintermediate features of the source and output images. The KL divergenceloss is defined as:

_(kl) =D _(kl)(ε_(θ)(x)∥N(0,I)).

In combination, the total loss becomes:

=

_(rec) +w _(per)

_(per) +w _(id)

_(id) +w _(kl)

_(kl)where w_(per), w_(id), w_(kl) are relative weights for thereconstruction loss, perceptual loss, identity loss, and KL divergenceloss, respectively.

Using a GAN model (e.g., StyleGAN-based model and/or StyleGAN2 model) asthe base, the intermediate style codes mapped from Z+ are injected intodifferent layers of the StyleGAN2 generator 418 and can semanticallycontrol image generation. The style codes broadly fall into threegroups: 1) style codes lying in lower layers control coarser attributeslike facial shapes, 2) middle layer codes control more localized facialfeatures, while 3) high layer codes correspond to fine details such asreflectance and texture. One straightforward way to embed an input imageis to directly estimate the combined latent code 18×512 z in Z+ from afully connected layer. However, it turns out to be difficult toeffectively train such a network.

To address this issue, a hierarchy of a pyramid network is used tocapture various levels of detail from different layers. FIG. 5 depictsadditional details of the hierarchical variational autoencoder 500 inaccordance with examples of the present disclosure. As depicted in FIG.5 , an input image from the plurality of training images 501 at anexample resolution of 256×256 is passed through a headless pyramidnetwork 502 to produce multiple levels of feature maps at differentsizes. In examples, the multiple levels of feature maps at differentsizes correspond to coarse, medium, and fine details. Of course,additional levels and sizes of feature maps are contemplated. Eachlevel's feature map is provided to a separate sub-encoder block 504,506, 508 to produce a 6×512 code 512. A combined 18×512 code 512 can bepassed to the fully connected layers (e.g., FC) to generate the means514 and standard deviations 516 representing the Gaussian importancedistribution in Z+. The hierarchical variational autoencoder 500 may bethe same as or similar to the hVAE 404 (FIG. 4 ) and 318 (FIG. 3 ). Theplurality of training images 501 may be the same as or similar to theplurality of training images 402 (FIG. 2 ) and 325 (FIG. 3 ).

FIG. 6 depicts details of an attribute-aware generator 600 in accordancewith examples of the present disclosure. The attribute-aware generator600 is based on a StyleGAN2 generator (e.g., StyleGAN2 generator 320(FIG. 3 )), but enhanced with a multi-path structure to better adapt todifferent features corresponding to known attributes, such as gender.Typically, when artists design characters, they often emphasizeattribute-dependent characteristics to enhance appearance. Thoseattribute-dependent characteristics usually involve different facialgeometric ratios as well as different facial features. Directly usingthe existing single-path StyleGAN2 structure and a single discriminatormay not be best at distinguishing these attribute-dependentcharacteristics, as training several single-path generators to cater todifferent attributes will increase time and memory. For efficiency, amulti-path structure may be embedded within a same attribute-awaregenerator

_(ϕ) _(t) ={

_(ϕ) _(t) ^(k)}, k∈

corresponding to the different attributes

, while using multiple discriminators D={D_(k)Dk}. The attribute-awaregenerator 600 depicts a first path 602 and a second path 604. Of course,more than two paths are contemplated. Since lower layers of the networkguide coarse-level features like facial shapes, while higher layersaffect facial reflectance and textures, the multi-path structure is moreappropriately embedded within the lower layers. Nonetheless, thisstructure can also be placed into the higher layers in situations whereit may be more appropriate. Other known attributes include, but are notlimited to hair color, hair length, glasses/no glasses, emotion,lighting, pose, etc.).

FIG. 7 depicts additional details directed to training theattribute-aware generator 714 in accordance with examples of the presentdisclosure. As previously mentioned, to mitigate the small datasetproblem and better preserve user identity, transfer learning and anearly stopping strategy are used to train the attribute-aware generator714. Each latent code z 706, sampled from a standard Gaussiandistribution, is first mapped to an intermediate code w 710 via themulti-layer perceptron 708. Each intermedia code w 710 is forwarded intoan affine transform in a style block of the attribute-aware generator714 and therefore controls the image generation via adaptive instancenormalization (AdaIN). When decoding, a constant feature map is firstinitialized by the attribute-aware generator 714. Multiple paths (e.g.,602, 604 from FIG. 6 ) are used in the lower layers for attributespecificity, while shared high layers unify texture appearance. Multipleattribute-specific discriminators (e.g., discriminator D in FIG. 6 ) areused to evaluate a quality of the generated images.

Transfer learning is used to train the attribute-aware generator 714. Asartistic portraits share obvious perceptual correspondences to realportraits, AgileGAN relies on the GAN model (e.g., StyleGAN-based modeland/or StyleGAN2 model), pre-trained on a dataset, as the initializationweights. The attribute-aware generator 714 is subsequently fine-tuned onthe smaller stylized dataset (e.g., plurality of exemplar images 702)using transfer learning from the pre-trained GAN generator 712 (e.g.,StyleGAN-based generator and/or StyleGAN2 generator). Benefits of usingStyleGAN2 for stylization include but are not limited to: 1) fine tuningcan significantly reduce training data and time needed for high qualitygeneration, compared to training from scratch, 2) StyleGAN2'scoarse-to-fine generation architecture can support various artisticstyles, including geometric and appearance stylization, and 3) thefine-tuned generator

_(ϕ) _(t) (z) which is derived from the original model

_(ϕ) _(o) (z) can form a natural correspondence when given the samelatent codes, even with different generator parameters of ϕ. Therefore,once trained, when given an input image x, the inverse mapped latentcode z 706 can first be obtained from an hVAE and passed to differentstylized generators 714 (trained on different stylized datasets). Thisresults in different stylized images, i.e. {

_(ϕ) ₁ (ε_(θ)(x)),

_(ϕ) ₂ (ε_(θ)(x)),

_(ϕ) ₃ (ε_(θ)(x)) . . . }.

During the fine-tuning process of the attribute-aware generator

_(ϕ) _(t) 714, four loss functions are considered. An adversarial lossfunction is used to match the distribution of the translated images tothe target domain distribution:

ℒ adv = ∑ k ∈ 𝔸 𝔼 y k [ min ⁡ ( 0 , - 1 + D k ( y k ) ) ] + 𝔼 𝓏 ~ N ⁡ ( 0, I ) [ min ⁡ ( 0 , - 1 - D k ( ϕ t k ( 𝓏 ) ) ) ]Where y_(k) are target style images, classified by attribute k. Topreserve the recognizable identity of the generated image, a similarityloss at perceptual level is introduced, given by a modified LPIPS loss.Specifically, differences from the first 9 layers of the VGG16-basedLPIPS are discarded and the remaining differences from higher levellayers are used. This helps in capturing the facial structuralsimilarity, while ignoring local appearance variation.

ℒ sim = ∑ k ∈ 𝔸 ∑ i = 9 3 ⁢ 0 ( ℒ lpips i ( ϕ t k ( 𝓏 ) , ϕ 0 ( 𝓏 ) ) )

To help improve training stability and prevent artifact formations,regularizing terms are employed. For discriminators, R1 regularizationmay be used.

$\mathcal{L}_{R1} = {\frac{\gamma}{2}{\sum\limits_{k \in {\mathbb{A}}}\left( {{\mathbb{E}}_{y_{k}}\left\lbrack {{\nabla{D_{k}\left( y_{k} \right)}}}^{2} \right\rbrack} \right)}}$where γ=10 is the hyper-parameter for gradient regularization. For theStyleGAN2 generator 712, a standard perceptual path-lengthregularization

_(path) from StyleGAN2 712 is used to aid reliability and behaviorsconsistency in generative models.

The generator and discriminators of the pre-trained StyleGAN model arejointly trained to optimize the combine objective of:

${\underset{\phi}{\min}\underset{D}{\max}\mathcal{L}_{adv}} + {w_{sim}\mathcal{L}_{sim}} + {w_{R1}\mathcal{L}_{R1}} + {w_{path}\mathcal{L}_{path}}$where w_(sim)=0.5, w_(R1)=5, w_(path)=2 are relative weights for theadversarial loss, similarity loss, and regularization loss,respectively.

A potential issue with small datasets is that the discriminator of thepre-trained GAN model (e.g., StyleGAN-based model and/or StyleGAN2model) may overfit the training examples, causing instability anddegradation in GAN training. To mitigate this issue, an early stoppingstrategy is adopted to stop training once a desired stylization effecthas been achieved. Increasing the number of iterations may lead to anincreased deviation from the original input expression. Thus, to strikea balance between input fidelity and stylistic fit, training can bestopped early (e.g., after 1200 iterations).

FIG. 8 depicts details directed to the inference process of the trainedAgileGAN model 802 in accordance with examples of the presentdisclosure. The trained AgileGAN model 802 may be the same as or similarto the trained AgileGAN model 336 (FIG. 3 ). More specifically, given aninput face image 804, the input image 804 is preprocessed at thepreprocessor 806 to warp and normalize the input image to a 256×256resolution based on its landmarks. The processed image is then encodedby the trained hierarchical variational autoencoder 808 to obtain thelatent Gaussian posterior distribution q(z|x). The trained hierarchicalvariational autoencoder 808 may be the same as or similar to the hVAE338 (FIG. 3 ). Since this posterior/importance distribution is relevantduring the training of the hierarchical variational autoencoder 808,instead of using a sample from this distribution during inference, thedistribution mean is used as the latent code z 810, which bettermaintains temporal consistency. This z code 810 is then mapped to the wcode 814 via the multi-layer perceptron 812, and then passed to a chosenstylized generator, such as the trained attribute-aware generator 816trained using previous exemplar images, to generate a stylized image818. Though a variety of resolutions are possible, the stylized image818 may be in a 1024×1024 resolution. In some cases, there may be highfrequency artifacts generated by the attribute-aware generator 816. Inthese cases, multiple instances may be sampled from the imputed Gaussiandistribution (e.g., z space 810), leading to multiple output images 818.An output image 818 without artifacts can be selected, either manuallyor by selecting the output image 818 having the smallest averageperceptual distance among the output images. To account for someattributes, an external pre-trained corresponding attribute detectornetwork may be used to select one or more of the output images 818 bestembodying the desired attribute(s). In total, the inference stage maytake less than 130 ms per image.

FIG. 9 depicts details of a method 900 for training an AgileGAN model inaccordance with examples of the present disclosure. A general order forthe steps of the method 900 is shown in FIG. 9 . Generally, the method900 starts at 902 and ends at 912. The method 900 may include more orfewer steps or may arrange the order of the steps differently than thoseshown in FIG. 9 . The method 900 can be executed as a set ofcomputer-executable instructions executed by a computer system andencoded or stored on a computer readable medium. In examples, aspects ofthe method 900 are performed by one or more processing devices, such asa computer or server. Further, the method 900 can be performed by gatesor circuits associated with a processor, Application Specific IntegratedCircuit (ASIC), a field programmable gate array (FPGA), a system on chip(SOC), a neural processing unit, or other hardware device. Hereinafter,the method 900 shall be explained with reference to the systems,components, modules, software, data structures, user interfaces, etc.described in conjunction with FIGS. 1-8 .

The method starts at 902, where flow may proceed to one or both of 904and/or 928. At 904, a plurality of training images is received. Theplurality of training images may be the same as or similar to theplurality of training images 345 (FIG. 3 ) and/or 402 (FIG. 4 ) and maybe different from a plurality of images used to train an initial GANmodel (e.g., StyleGAN-based model and/or StyleGAN2 model). From 904, themethod 900 may proceed to 906, where an hVAE is trained using theplurality of received training images and a pre-trained GAN model (e.g.,StyleGAN-based model and/or StyleGAN2 model). More specifically, thetraining images may be preprocessed at 908 to an input image resolutionof 256×256 for example, and then passed through a headless pyramidnetwork at 910 to produce multiple levels of feature maps at differentsizes. For example, three levels of feature maps corresponding tocoarse, medium and fine details may be obtained. At 912, each level'sfeature map then goes through a separate sub-encoder block to produce acode, such as a 6×512 code. The combined code from each of the layers(e.g., 18×512 code) is passed to the fully connected layers at 914 togenerate the means and standard deviations at 916 representing theGaussian importance distribution in Z+ space. A latent vector z may besampled from the Z+ space at 918 and mapped to w in a W+ space at 920via a multi-perceptron layer. The w vector may be provided to apre-trained StyleGAN2 generator to reconstruct an image based on thelatent vector z to obtain an output image at 922. The differencesbetween the output image and the input image can be used to update theweights associated with the hVAE at 924. Once trained, the output of thehVAE may be provided to a trained attribute-aware generator.

In examples where the method 900 proceeds to 928, a plurality ofexemplar images are received. The plurality of exemplar images may bethe same as or similar to the plurality of exemplar images 206 (FIG. 2), 326 (FIG. 3 ), and/or 702 (FIG. 7 ). The method 900 may proceed tofine-tune an attribute-aware generator at 930. More specifically, theexemplar images may first be preprocessed at 932 by extractinglandmarks, conducting normalization by aligning position (such as eyeposition), and cropped to a specific input size (e.g., 1024×1024). At934, the processed exemplar images are used to train an attribute-awaregenerator using a GAN model (e.g., StyleGAN-based model and/or StyleGAN2model) pre-trained on real portrait datasets as the initializationweights for the generator and the discriminators. Using transferlearning, the weights are fine-tuned with the exemplar images. Themethod 900 may end at 936.

FIG. 10 depicts details of a method 1000 for generating a stylized imagefrom an input image in accordance with examples of the presentdisclosure. A general order for the steps of the method 1000 is shown inFIG. 10 . Generally, the method 1000 starts at 1002 and ends at 1016.The method 1000 may include more or fewer steps or may arrange the orderof the steps differently than those shown in FIG. 10 . The method 1000can be executed as a set of computer-executable instructions executed bya computer system and encoded or stored on a computer readable medium.In examples, aspects of the method 1000 are performed by one or moreprocessing devices, such as a computer or server. Further, the method1000 can be performed by gates or circuits associated with a processor,Application Specific Integrated Circuit (ASIC), a field programmablegate array (FPGA), a system on chip (SOC), a neural processing unit, orother hardware device. Hereinafter, the method 1000 shall be explainedwith reference to the systems, components, modules, software, datastructures, user interfaces, etc. described in conjunction with FIGS.1-9 .

The method starts at 1002, where flow may proceed to 1004. At 1004, animage to be stylized is received. For example, an input image that isthe same or similar to the input image 212 may be received by anAgileGAN model. The method 1000 may proceed to preprocess the receivedimage at 1006. At 1008 an inversion process may occur where thepreprocessed image may be then encoded by an hVAE trained by method 900for example, to get the posterior distribution, or mean, as the latentcode z. At 1010, the latent code z is mapped to the w code and thenpassed to a chosen stylized generator to generate a stylized image at1012. The stylized image may then be output to and displayed at adisplay device at 1014. The method 1000 may end at 1016.

FIGS. 11-13 and the associated descriptions provide a discussion of avariety of operating environments in which aspects of the disclosure maybe practiced. However, the devices and systems illustrated and discussedwith respect to FIGS. 11-13 are for purposes of example and illustrationand are not limiting of a vast number of computing device configurationsthat may be utilized for practicing aspects of the disclosure, describedherein.

FIG. 11 is a block diagram illustrating physical components (e.g.,hardware) of a computing system 1100 with which aspects of thedisclosure may be practiced. The computing device components describedbelow may be suitable for the computing and/or processing devicesdescribed above. In a basic configuration, the computing system 1100 mayinclude at least one processing unit 1102 and a system memory 1104.Depending on the configuration and type of computing device, the systemmemory 1104 may comprise, but is not limited to, volatile storage (e.g.,random-access memory (RAM)), non-volatile storage (e.g., read-onlymemory (ROM)), flash memory, or any combination of such memories.

The system memory 1104 may include an operating system 1105 and one ormore program modules 1106 suitable for running software application1120, such as one or more components supported by the systems describedherein. As examples, system memory 1104 may include the imageacquisition manager 1121, the AgileGAN model 1122, and the trainedAgileGAN model 1123. The image acquisition manager 1121 may be the sameas or similar to the image acquisition manager 316 previously described.The AgileGAN training framework 1122 may be the same as or similar tothe AgileGAN training framework 317 previously described. The trainedAgileGAN model 1123 may be the same as or similar to the trainedAgileGAN model 336 previously described. The operating system 1105, forexample, may be suitable for controlling the operation of the computingsystem 1100.

Furthermore, embodiments of the disclosure may be practiced inconjunction with a graphics library, other operating systems, or anyother application program and is not limited to any particularapplication or system. This basic configuration is illustrated in FIG.11 by those components within a dashed line 1108. The computing system1100 may have additional features or functionality. For example, thecomputing system 1100 may also include additional data storage devices(removable and/or non-removable) such as, for example, magnetic disks,optical disks, or tape. Such additional storage is illustrated in FIG.11 by a removable storage device 1109 and a non-removable storage device1110.

As stated above, a number of program modules and data files may bestored in the system memory 1104. While executing on the processing unit1102, the program modules 1106 (e.g., software applications 1120) mayperform processes including, but not limited to, the aspects, asdescribed herein. Other program modules that may be used in accordancewith aspects of the present disclosure may include electronic mail andcontacts applications, word processing applications, spreadsheetapplications, database applications, slide presentation applications,drawing or computer-aided programs, etc.

Furthermore, embodiments of the disclosure may be practiced in anelectrical circuit discrete electronic elements, packaged or integratedelectronic chips containing logic gates, a circuit utilizing amicroprocessor, or on a single chip containing electronic elements ormicroprocessors. For example, embodiments of the disclosure may bepracticed via a system-on-a-chip (SOC) where each or many of thecomponents illustrated in FIG. 11 may be integrated onto a singleintegrated circuit. Such an SOC device may include one or moreprocessing units, graphics units, communications units, systemvirtualization units and various application functionality, all of whichare integrated (or “burned”) onto the chip substrate as a singleintegrated circuit. When operating via an SOC, the functionality,described herein, with respect to the capability of client to switchprotocols may be operated via application-specific logic integrated withother components of the computing system 1100 on the single integratedcircuit (chip). Embodiments of the disclosure may also be practicedusing other technologies capable of performing logical operations suchas, for example, AND, OR, and NOT, including but not limited tomechanical, optical, fluidic, and quantum technologies. In addition,embodiments of the disclosure may be practiced within a general-purposecomputer or in any other circuits or systems.

The computing system 1100 may also have one or more input device(s) 1112such as a keyboard, a mouse, a pen, a sound or voice input device, atouch or swipe input device, etc. The one or more input device 1112 mayinclude an image sensor. The image sensor may acquire an image andprovide the image to the image acquisition manager 1121. The outputdevice(s) 1114 such as a display, speakers, a printer, etc. may also beincluded. The aforementioned devices are examples and others may beused. The computing system 1100 may include one or more communicationconnections 1116 allowing communications with other computingdevices/systems 1150. Examples of suitable communication connections1116 include, but are not limited to, radio frequency (RF) transmitter,receiver, and/or transceiver circuitry; universal serial bus (USB),parallel, and/or serial ports.

The term computer readable media as used herein may include computerstorage media. Computer storage media may include volatile andnonvolatile, removable, and non-removable media implemented in anymethod or technology for storage of information, such as computerreadable instructions, data structures, or program modules. The systemmemory 1104, the removable storage device 1109, and the non-removablestorage device 1110 are all computer storage media examples (e.g.,memory storage). Computer storage media may include RAM, ROM,electrically erasable read-only memory (EEPROM), flash memory or othermemory technology, CD-ROM, digital versatile disks (DVD) or otheroptical storage, magnetic cassettes, magnetic tape, magnetic diskstorage or other magnetic storage devices, or any other article ofmanufacture which can be used to store information and which can beaccessed by the computing system 1100. Any such computer storage mediamay be part of the computing system 1100. Computer storage media doesnot include a carrier wave or other propagated or modulated data signal.

Communication media may be embodied by computer readable instructions,data structures, program modules, or other data in a modulated datasignal, such as a carrier wave or other transport mechanism, andincludes any information delivery media. The term “modulated datasignal” may describe a signal that has one or more characteristics setor changed in such a manner as to encode information in the signal. Byway of example, and not limitation, communication media may includewired media such as a wired network or direct-wired connection, andwireless media such as acoustic, radio frequency (RF), infrared, andother wireless media.

FIGS. 12A-12B illustrate a computing system 1200, for example, a mobiletelephone, a smart phone, wearable computer (such as a smart watch), atablet computer, a desktop computer, a laptop computer, and the like,with which examples of the disclosure may be practiced. With referenceto FIG. 12A, one aspect of a computing system 1200 for implementing theaspects is illustrated. In a basic configuration, the computing system1200 is a desktop computer having both input elements and outputelements. The computing system 1200 typically includes a display 1205,which may also function as an input device (e.g., a touch screendisplay). The computing system 1200 may also include a keypad 1235. Thekeypad 1235 may be a physical keypad or a “soft” keypad generated on thetouch screen display.

In various embodiments, the output elements include the display 1205 forshowing a graphical user interface (GUI), a visual indicator 1220 (e.g.,a light-emitting diode), and/or an audio transducer 1225 (e.g., aspeaker). In yet another aspect, the computing system 1200 incorporatesinput and/or output ports, such as an audio input (e.g., a microphonejack), an audio output (e.g., a headphone jack), and a video output(e.g., an HDMI port) for sending signals to or receiving signals from anexternal device.

FIG. 12B is a block diagram illustrating the architecture of one aspectof a mobile computing system. That is, the computing system 1200 canincorporate a system (e.g., an architecture) 1202 to implement someaspects. In one embodiment, system 1202 is implemented as a “computingsystem” capable of running one or more applications (e.g., browser,e-mail, calendaring, contact managers, messaging clients, games, andmedia clients/players). In some aspects, system 1202 is integrated as acomputing system, such as a desktop computer.

One or more application programs 1266 may be loaded into the memory 1262and run on or in association with the operating system 1264. Examples ofthe application programs include phone dialer programs, e-mail programs,personal information management (PIM) programs, word processingprograms, spreadsheet programs, Internet browser programs, messagingprograms, maps programs, and so forth. System 1202 also includes anonvolatile storage area 1268 within the memory 1262. The nonvolatilestorage area 1268 may be used to store persistent information thatshould not be lost if the system 1202 is powered down. The applicationprograms 1266 may use and store information in the nonvolatile storagearea 1268, such as e-mail or other messages used by an e-mailapplication, and the like. A synchronization application (not shown)also resides on system 1202 and is programmed to interact with acorresponding synchronization application resident on a host computer tokeep the information stored in the nonvolatile storage area 1268synchronized with corresponding information stored at the host computer.As should be appreciated, other applications may be loaded into thememory 1262 and run on the computing system 1200 described herein (e.g.,search engine, extractor module, etc.).

The system 1202 has a power supply 1270, which may be implemented as oneor more batteries. The power supply 1270 might further include anexternal power source, such as an AC adapter or a powered docking cradlethat supplements or recharges the batteries.

The system 1202 may also include a radio interface layer 1272 thatperforms the function of transmitting and receiving radio frequencycommunications. The radio interface layer 1272 facilitates wirelessconnectivity between the system 1202 and the “outside world” via acommunications carrier or service provider. Transmissions to and fromthe radio interface layer 1272 are conducted under the control of theoperating system 1264. In other words, communications received by theradio interface layer 1272 may be disseminated to the applicationprograms 1266 via the operating system 1264, and vice versa.

The system 1202 may further include a video interface 1276 that enablesan operation of an on-board camera 1230 to record still images, videostream, and the like. A computing system 1200 implementing the system1202 may have additional features or functionality. For example, thecomputing system 1200 may also include additional data storage devices(removable and/or non-removable) such as magnetic disks, optical disks,or tape. Such additional storage is illustrated in FIG. 12B by thenonvolatile storage area 1268.

Data/information generated or captured by the computing system 1200 andstored via the system 1202 may be stored locally on the computing system1200, as described above, or the data may be stored on any number ofstorage media that may be accessed by the device via the radio interfacelayer 1272 or via a wired connection between the computing system 1200and a separate computing system associated with the computing system1200, for example, a server computer in a distributed computing network,such as the Internet. As should be appreciated, such data/informationmay be accessed via the computing system 1200 via the radio interfacelayer 1272 or via a distributed computing network. Similarly, suchdata/information may be readily transferred between computing systemsfor storage and use according to well-known data/information transferand storage means, including electronic mail and collaborativedata/information sharing systems.

FIG. 13 illustrates one aspect of the architecture of a system forprocessing data received at a computing system from a remote source,such as a personal computer 1304, tablet computing device 1306, ormobile computing device 1308, as described above. The personal computer1304, tablet computing device 1306, or mobile computing device 1308 mayinclude one or more applications 1320; such applications may include butare not limited to the image acquisition manager 1321, the AgileGANtraining framework 1322, and the trained AgileGAN model 1323. Content ata server device 1302 may be stored in different communication channelsor other storage types. For example, various documents may be storedusing a directory service, a web portal, a stylized image service, aninstant messaging store, or social networking services.

One or more of the previously described program modules 1106 or softwareapplications 1120 may be employed by server device 1302 and/or thepersonal computer 1304, tablet computing device 1306, or mobilecomputing device 1308, as described above. For example, the serverdevice 1302 may include the image acquisition manager 1321, the AgileGANtraining framework 1322, and the trained AgileGAN model 1323. The imageacquisition manager 1321 may be the same as or similar to the imageacquisition manager 316 and 1121 previously described. The AgileGANtraining framework 1322 may be the same as or similar to the AgileGANtraining framework 317 and 1122 previously described. The trainedAgileGAN model 1323 may be the same as or similar to the trainedAgileGAN model 336 and 1323 previously described.

The server device 1302 may provide data to and from a client computingdevice such as a personal computer 1304, a tablet computing device 1306and/or a mobile computing device 1308 (e.g., a smart phone) through anetwork 1315. By way of example, the computer system described above maybe embodied in a personal computer 1304, a tablet computing device 1306and/or a mobile computing device 1308 (e.g., a smart phone). Any ofthese examples of the computing devices may obtain content from thestore 1316, in addition to receiving graphical data useable to be eitherpre-processed at a graphic-originating system, or post-processed at areceiving computing system.

In addition, the aspects and functionalities described herein mayoperate over distributed systems (e.g., cloud-based computing systems),where application functionality, memory, data storage and retrieval andvarious processing functions may be operated remotely from each otherover a distributed computing network, such as the Internet or anintranet. User interfaces and information of various types may bedisplayed via on-board computing device displays or via remote displayunits associated with one or more computing devices. For example, userinterfaces and information of various types may be displayed andinteracted with on a wall surface onto which user interfaces andinformation of various types are projected. Interaction with themultitude of computing systems with which embodiments of the inventionmay be practiced include, keystroke entry, touch screen entry, voice orother audio entry, gesture entry where an associated computing device isequipped with detection (e.g., camera) functionality for capturing andinterpreting user gestures for controlling the functionality of thecomputing device, and the like.

The present disclosure relates to systems and methods for generating astylized image according to at least the examples provided in thesections below:

(A1) In one aspect, some examples include a method for generating astylized image. The method may include receiving an input image;encoding the input image using a variational autoencoder to obtain alatent vector; providing the latent vector to a generative adversarialnetwork (GAN) generator; generating, by the GAN generator, a stylizedimage from the GAN generator; and providing the stylized image as anoutput.

(A2) In some examples of A1, the method includes receiving a pluralityof exemplar images; training the GAN generator using transfer learningbased on the received plurality of exemplar images; and terminating theprocess of training when the output of the GAN generator satisfies apredetermined condition at a first time.

(A3) In some examples of A1-A2, the method includes receiving aplurality of training images; and training the variational autoencoderwhile keeping the weights of the pre-trained GAN network fixed.

(A4) In some examples of A1-A3, the latent vector is sampled from astandard Gaussian distribution.

(A5) In some examples of A1-A2, the method includes mapping the latentvector to an intermediate vector; and forwarding the intermediate vectorto an affine transform within a style block of the GAN generator.

(A6) In some examples of A1-A5, the GAN generator includes a multi-pathstructure corresponding to two or more different attributes.

(A7) In some examples of A1-A6, the method includes passing the receivedinput image through a headless pyramid network to produce multiplelevels of features maps at different sizes; and combining an encoding ofeach level's respective feature map to obtain the latent vector.

(A8) In some examples of A1-A7, the GAN generator comprises a StyleGAN2generator.

In yet another aspect, some examples include a computing systemincluding one or more processors and memory coupled to the one or moreprocessors, the memory storing one or more instructions which whenexecuted by the one or more processors, causes the one or moreprocessors perform any of the methods described herein (e.g., A1-A8described above).

In yet another aspect, some examples include a non-transitorycomputer-readable storage medium storing one or more programs forexecution by one or more processors of a storage device, the one or moreprograms including instructions for performing any of the methodsdescribed herein (e.g., A1-A8 described above).

Aspects of the present disclosure, for example, are described above withreference to block diagrams and/or operational illustrations of methods,systems, and computer program products according to aspects of thedisclosure. The functions/acts noted in the blocks may occur out of theorder as shown in any flowchart. For example, two blocks shown insuccession may in fact be executed substantially concurrently or theblocks may sometimes be executed in the reverse order, depending uponthe functionality/acts involved.

The description and illustration of one or more aspects provided in thisapplication are not intended to limit or restrict the scope of thedisclosure as claimed in any way. The aspects, examples, and detailsprovided in this application are considered sufficient to conveypossession and enable others to make and use the best mode of claimeddisclosure. The claimed disclosure should not be construed as beinglimited to any aspect, example, or detail provided in this application.Regardless of whether shown and described in combination or separately,the various features (both structural and methodological) are intendedto be selectively included or omitted to produce an embodiment with aparticular set of features. Having been provided with the descriptionand illustration of the present application, one skilled in the art mayenvision variations, modifications, and alternate aspects falling withinthe spirit of the broader aspects of the general inventive conceptembodied in this application that do not depart from the broader scopeof the claimed disclosure.

What is claimed is:
 1. A method for generating a stylized image, themethod comprising: receiving an input image; encoding the input imageusing a variational autoencoder to obtain a latent vector by: passingthe received input image through a headless pyramid network to producemultiple levels of features maps at different sizes; encoding, for eachof the levels of features maps at different sizes, each level'srespective feature map at the different size with a separate encoder ofa plurality of encoders to produce a code, and combining the encodedcode of each level's respective feature map to obtain the latent vector;providing the latent vector to a pre-trained generative adversarialnetwork (GAN) model; generating, by the pre-trained GAN model, astylized image from the pre-trained GAN model, the generated stylizedimage being a cartoon style image of the input image; and providing thestylized image as an output, wherein the pre-trained GAN model includesa multi-path structure corresponding to two or more differentattributes.
 2. The method of claim 1, further comprising: receiving aplurality of exemplar images; training a GAN model using transferlearning based on the received plurality of exemplar images; andterminating the process of training when the output of the GAN modelsatisfies a predetermined condition at a first time to produce thepre-trained GAN model.
 3. The method of claim 2, further comprising:receiving a plurality of training images; and training the variationalautoencoder while keeping the weights of the pre-trained GAN modelfixed.
 4. The method of claim 1, wherein the latent vector is sampledfrom a standard Gaussian distribution.
 5. The method of claim 4, furthercomprising: mapping the latent vector to an intermediate vector; andforwarding the intermediate vector to an affine transform within a styleblock of the pre-trained GAN model.
 6. The method of claim 1, whereinthe pre-trained GAN model comprises a pre-trained StyleGAN2 model.
 7. Asystem configured to generate a stylized image, the system comprising: aprocessor; and memory including instructions, which when executed by theprocessor, causes the processor to: receive an input image; encode theinput image using a variational autoencoder to obtain a latent vectorby: passing the received input image through a headless pyramid networkto produce multiple levels of features maps at different sizes;encoding, for each of the levels of features maps at different sizes,each level's respective feature map at the different size with aseparate encoder of a plurality of encoders to produce a code, andcombining the encoded code of each level's respective feature map toobtain the latent vector; provide the latent vector to a pre-trainedgenerative adversarial network (GAN) model; generate, by the pre-trainedGAN model, a stylized image from the pre-trained GAN model, thegenerated stylized image being a cartoon style image of the input image;and provide the stylized image as an output, wherein the pre-trained GANmodel includes a multi-path structure corresponding to two or moredifferent attributes.
 8. The system of claim 7, wherein theinstructions, when executed by the processor, cause the processor to:receive a plurality of exemplar images; train the GAN model usingtransfer learning based on a pre-trained GAN model and the receivedplurality of exemplar images and terminate the process of training whenthe output of the GAN model satisfies a predetermined condition at afirst time to produce the pre-trained GAN model.
 9. The system of claim8, wherein the instructions, when executed by the processor, cause theprocessor to: receive a plurality of training images; and training thevariational autoencoder while keeping the weights of the pre-trained GANmodel fixed.
 10. The system of claim 7, wherein the latent vector issampled from a standard Gaussian distribution.
 11. The system of claim10, wherein the instructions, when executed by the processor, cause theprocessor to: map the latent vector to an intermediate vector; andforward the intermediate vector to an affine transform within a styleblock of the pre-trained GAN model.
 12. The system of claim 7, whereinthe pre-trained GAN model comprises a pre-trained StyleGAN2 model.
 13. Anon-transitory computer-readable storage medium including instructions,which when executed by a processor, cause the processor to: receive aninput image; encode the input image using a variational autoencoder toobtain a latent vector by: passing the received input image through aheadless pyramid network to produce multiple levels of features maps atdifferent sizes; encoding, for each of the levels of features maps atdifferent sizes, each level's respective feature map at the differentsize with a separate encoder of a plurality of encoders to produce acode, and combining the encoded code of each level's respective featuremap to obtain the latent vector; provide the latent vector to apre-trained generative adversarial network (GAN) model; generate, by thepre-trained GAN model, a stylized image from the pre-trained GAN model,the generated stylized image being a cartoon style image of the inputimage; and provide the stylized image as an output, wherein thepre-trained GAN model includes a multi-path structure corresponding totwo or more different attributes.
 14. The non-transitorycomputer-readable storage medium of claim 13, wherein the instructions,which when executed by a processor, cause the processor to: map a latentvector sampled from a standard Gaussian distribution to an intermediatevector; and forward the intermediate vector to an affine transformwithin a style block of the pre-trained GAN model.
 15. Thenon-transitory computer-readable storage medium of claim 14, wherein thecombined code from each level's respective feature map to obtain thelatent vector is passed to fully connected layers to generate means andstandard deviations representing Gaussian importance distribution in aZ+ space.
 16. The non-transitory computer-readable storage medium ofclaim 13, wherein the instructions, which when executed by a processor,cause the processor to: receive a plurality of exemplar images includingcartoon characters; train GAN model using transfer learning based on thereceived plurality of exemplar images; and terminating the process oftraining after at most 1200 interactions to produce the pre-trained GANmodel.
 17. The non-transitory computer-readable storage medium of claim13, wherein the pre-trained GAN model comprises a pre-trained StyleGAN2model.